Take Home Exercise 3

Mini Case 3 of Vast Challenge 2023

Author

Oh Jia Wen

Published

June 4, 2023

Modified

June 4, 2023

1. OVERVIEW

1.1 The Task

2. Datasets

3. Data Preparation

3.1 Install R-packages

Using p_load() of pacman package to load and install the following libraries:

pacman::p_load(jsonlite, tidygraph, ggraph, 
               visNetwork, graphlayouts, ggforce, 
               skimr, tidytext, tidyverse, DT)
options(scipen = 999)

3.2 Importing Data

Importing JSON file by using jsonlite package.

MC3_challenge <- fromJSON("data/MC3.json")

This is not a directed graph. There is no flow by degree (directed = FALSE)

3.2.1 Extracting Edges

As the imported data file is a large list, we will extract the edges from MC3_challenge and save it as a tibble data frame called MC3_edges. The code is extracted in the following manner:

  • distinct() is used to remove duplicated records

  • mutate() and as.character() are used to convert field data type from list to character

  • group_by() and summarise() are used to count the number of unique links

  • filter(source!=target) is used to ensure that both companies are not identical

Show the code
MC3_edges <-as_tibble(MC3_challenge$links) %>%
  distinct() %>%
  mutate(source = as.character(source),
         target = as.character(target),
         type = as.character(type)) %>%
  group_by(source,target, type) %>%
  summarise(weights = n()) %>%
  filter (source != target) %>%
  ungroup()

3.2.2 Extracting Nodes

Similarly, we will extract the nodes from MC3_challenge and save it as a tibble data frame called MC3_nodes. The code is extracted in the following manner:

  • mutate() and as.character() are used to convert data type from list to character

  • as.numeric(as.character()) are used to convert revenue_omu from list to character before converting it to numeric data type.

  • select() is used to reorganize the sequence

MC3_nodes <-as_tibble(MC3_challenge$nodes) %>%
  mutate(country = as.character(country),
         id = as.character(id),
         product_services = as.character(product_services),
         revenue_omu = as.numeric(as.character(revenue_omu)),
         type = as.character(type)) %>%
  select(id,country,type,revenue_omu,product_services)

4. Data Exploration

In this section, we will explore the nodes and edges data frame to identify aspects for data wrangling.

4.1 Exploring the edges data frame

skim() of [skimr] package is used to display the summary statistic of MC3_edges tibble data frame. As observed, there are not missing values in all fields.

skim(MC3_edges)
Data summary
Name MC3_edges
Number of rows 24036
Number of columns 4
_______________________
Column type frequency:
character 3
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
source 0 1 6 700 0 12856 0
target 0 1 6 28 0 21265 0
type 0 1 16 16 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
weights 0 1 1 0 1 1 1 1 1 ▁▁▇▁▁

datatable() of the [DT] package is used to display MC3_edges tiddle data frame as an interactive table.

DT::datatable(MC3_edges)

4.1.1 Plotting bar chart

ggplot(data = MC3_edges,
       aes(x= type)) +
  geom_bar()

4.2 Exploring the nodes

skim() of [skimr] package is used to display the summary statistic of MC3_nodes tibble data frame. As observed, there are not missing values in all fields.

skim(MC3_nodes)
Data summary
Name MC3_nodes
Number of rows 27622
Number of columns 5
_______________________
Column type frequency:
character 4
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
id 0 1 6 64 0 22929 0
country 0 1 2 15 0 100 0
type 0 1 7 16 0 3 0
product_services 0 1 4 1737 0 3244 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
revenue_omu 21515 0.22 1822155 18184433 3652.23 7676.36 16210.68 48327.66 310612303 ▇▁▁▁▁
DT:: datatable(MC3_nodes)

4.2.1 Plotting the bar chart

ggplot(data = MC3_nodes,
       aes(x=type)) +
  geom_bar()

5. Network Visualization and Analysis

5. 1 Building network model with tidygraph

id1 <- MC3_edges %>%
  select(source) %>%
  rename(id = source)
id2 <- MC3_edges %>%
  select(target) %>%
  rename(id = target)
MC3_nodes_master <- rbind(id1, id2) %>%
  distinct() %>%
  left_join(MC3_nodes,
            unmatched = "drop")
MC3_graph <- tbl_graph(nodes = MC3_nodes_master,
                       edges = MC3_edges,
                       directed = FALSE) %>%
  mutate(betweenness_centrality = centrality_betweenness(),
         closeness_centrality = centrality_closeness())

MC3_graph %>%
  filter(betweenness_centrality >= 100000) %>%
ggraph(layout = "fr") +
  geom_edge_link(aes(alpha=0.5)) +
  geom_node_point(aes(
    size = betweenness_centrality,
    colors = "lightblue",
    alpha = 0.5)) +
  scale_size_continuous(range=c(1,10))+
  theme_graph()

6. Text Sensing with Tidytext

[tidytext] package

6.1 Simple Word count

MC3_nodes %>%
  mutate(n_fish = str_count(product_services, "fish"))
# A tibble: 27,622 × 6
   id                          country type  revenue_omu product_services n_fish
   <chr>                       <chr>   <chr>       <dbl> <chr>             <int>
 1 Jones LLC                   ZH      Comp…  310612303. Automobiles           0
 2 Coleman, Hall and Lopez     ZH      Comp…  162734684. Passenger cars,…      0
 3 Aqua Advancements Sashimi … Oceanus Comp…  115004667. Holding firm wh…      0
 4 Makumba Ltd. Liability Co   Utopor… Comp…   90986413. Car service, ca…      0
 5 Taylor, Taylor and Farrell  ZH      Comp…   81466667. Fully electric …      0
 6 Harmon, Edwards and Bates   ZH      Comp…   75070435. Discount superm…      0
 7 Punjab s Marine conservati… Riodel… Comp…   72167572. Beef, pork, chi…      0
 8 Assam   Limited Liability … Utopor… Comp…   72162317. Power and Gas s…      0
 9 Ianira Starfish Sagl Import Rio Is… Comp…   68832979. Light commercia…      0
10 Moran, Lewis and Jimenez    ZH      Comp…   65592906. Automobiles, tr…      0
# ℹ 27,612 more rows

6.2 Tokenisation

Tokenisation refers to the process of breaking up a given text into units called tokens. It could be individual words, phrases or the entire sentences. unnest_tokens() is used. The unnested text goes to the output column - word after extracting it from product_services.

token_nodes <- MC3_nodes %>%
  unnest_tokens(word, 
                product_services)

6.2.1 Visualizing the Extracted words

token_nodes %>%
  count(word, sort = TRUE) %>%
  top_n(15) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
      labs(x = "Count",
      y = "Unique words",
      title = "Count of unique words found in product_services field")

6.3 Removing stopwords

stopwords_removed <- token_nodes %>% 
  anti_join(stop_words)

stopwords_removed %>%
  count(word, sort = TRUE) %>%
  top_n(15) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
      labs(x = "Count",
      y = "Unique words",
      title = "Count of unique words found in product_services field")

Given that there are 7750 unique texts in the word columnwe will focus on the words that are related to illegal fishing.

length(unique(stopwords_removed$word))
[1] 7750
#create custom stop words vector
custom_stopwords <- c("food")

clean_text <- stopwords_removed %>%
  filter(!word %in% custom_stopwords) %>%
  filter(!grepl("\\d", word)) %>%  #to remove numbers 
  filter(grepl("fish", word) | grepl("seafood", word ) | grepl("shrimp", word)) 


#  mutate(category = case_when(
#    grepl("products", word, ignore.case= TRUE) ~ "Products",
#    grepl("services", word, ignore.case = TRUE) ~ "Services",
#    TRUE ~ 'Others'))

clean_text %>%
  count(word, sort = TRUE) %>%
  top_n(20) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
      labs(x = "Count",
      y = "Unique words",
      title = "Count of unique words found in product_services field")